-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Correct categorization analyzer in ES|QL categorize #117695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
.../src/test/java/org/elasticsearch/compute/aggregation/blockhash/CategorizeBlockHashTests.java
Show resolved
Hide resolved
| COUNT():long | category:keyword | ||
| 1 | .*?\.\*\?Connected\.\+\?to\.\*\?.*? | ||
| 1 | .*?\.\*\?Connection\.\+\?error\.\*\?.*? | ||
| 1 | .*?\.\*\?Disconnected\.\*\?.*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The categorization analyzer is more aggressive w.r.t. specials chars, hence these changes.
| ; | ||
|
|
||
| COUNT():long | category:keyword | ||
| 2 | .*?DB.+?servers.*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB is parsed as a hex number and dropped. Obviously suboptimal, but unrelated to this PR.
...te/src/main/java/org/elasticsearch/compute/aggregation/blockhash/CategorizeRawBlockHash.java
Show resolved
Hide resolved
ef76db0 to
7f61b4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jan-elastic ! I have mostly minor comments; the only comment I care a bit more about is whether we need to pass the analysis registry as far down as we currently do, but even so this could be merged as-is.
...te/src/main/java/org/elasticsearch/compute/aggregation/blockhash/CategorizeRawBlockHash.java
Show resolved
Hide resolved
.../src/test/java/org/elasticsearch/compute/aggregation/blockhash/CategorizeBlockHashTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To show that the analysis registry is used correctly, could we maybe add one more test that uses a different analysis registry that leads to a different categorization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to write such test. How do I construct a different analysis registry, that contains different versions of the necessary filters/tokenizers. Sounds like quite some effort with low reward.
| 1 | .*?😊👍🏽.+?detcennocsiD.*? | ||
| 3 | .*?😊👍🏽.+?ot.+?detcennoC.*? | ||
| 3 | .*?😊👍🏽.+?rorre.+?noitcennoC.*? | ||
| 1 | .*?detcennocsiD.*? | ||
| 3 | .*?ot.+?detcennoC.*? | ||
| 3 | .*?rorre.+?noitcennoC.*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivancea , this change will solve the recent failing tests, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pretty sure it does
| enrichLookupService, | ||
| lookupFromIndexService, | ||
| new EsPhysicalOperationProviders(contexts) | ||
| new EsPhysicalOperationProviders(contexts, searchService.getIndicesService().getAnalysis()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could actually already build the categorizer here, to avoid unnecessary dependencies on the analysis registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in a different comment: in the future the analyzer should be configurable, and to achieve I think that it should be wired like this. (I agree that for now we could have created the fixed analyzer here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not so sure anymore, see #117695 (comment)
7f61b4e to
8ebf21e
Compare
* Correct categorization analyzer in ES|QL categorize * close categorizer if constructing analyzer fails * Rename capability CATEGORIZE_V4 * add comments
No description provided.